This R Markdown document includes contributions by Professor Jessi Kehe.

Setup details

Introduction to Correlation

Lion Ages

  • Biologists are interested in examining the relationship between the age of lions and the proportion of their nose that is black.
  • Data are collected from a group of lions whose ages are known.
  • The hope is to develop a model to predict the age of lions with unknown age from something that can be measured from a distance with minimal interference from an image of the lion’s face.
  • Experts use multiple characteristics in addition to nose color, such as mane length, teeth wear, and facial scarring.

Data

lions = read_csv("../../data/lions.csv") %>% 
  rename(black = proportion.black)

Numerical Summary

  • Calculate:
    • the number of cases;
    • means and standard deviations of each variable;
    • the correlation (more on this below)
## first few rows
head(lions)
## # A tibble: 6 × 2
##     age black
##   <dbl> <dbl>
## 1   1.1  0.21
## 2   1.5  0.14
## 3   1.9  0.11
## 4   2.2  0.13
## 5   2.6  0.12
## 6   3.2  0.13
lions_sum = lions %>% 
  summarize(across(everything(), list(mean = mean, sd = sd)),
            n = n(),
            r = cor(age, black)) %>% 
  relocate(n)

lions_sum            
## # A tibble: 1 × 6
##       n age_mean age_sd black_mean black_sd     r
##   <int>    <dbl>  <dbl>      <dbl>    <dbl> <dbl>
## 1    32     4.31   2.68      0.322    0.199 0.790

Graphical Summary

  • We use a scatter plot to show the relationship between these variables
  • Add a simple linear regression line with geom_smooth().
ggplot(lions, aes(x = age, y = black)) +
  geom_point() +
  xlab("Age (years)") +
  ylab("Percentage Black") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Ages and Lion Nose Color") +
  geom_smooth(se = FALSE, method = "lm") +
  theme_bw() +
  theme(text = element_text(size = 20)) 

  • We see a positive relationship between age and the proportion of the nose which is black.
    • As lions get older, the percentage of the nose that is black tends to increase
    • There is a fair amount of variability in this relationship, though
  • How can we measure the strength of a linear relationship?

The Correlation Coefficient r

  • The correlation coefficient r is a measure of the strength of a linear relationship between two quantitative variables.
  • The formula is

\[ r = \mathsf{Corr}(x,y) = \frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x} \right) \left(\frac{y_i - \bar{y}}{s_y} \right) \]

  • From the equation, note that:
    • \(\mathsf{Corr}(x,y) = \mathsf{Corr}(y,x)\)
      • The order of the variables does not matter
    • The individual values of \(x\) and \(y\) are standardized by subtracting the mean and dividing by the standard deviation (z scores)
    • \(r\) is (nearly) the average of the product of these standardized values
    • \(r\) is unitless
      • Any units in \(x\) and \(y\) are eliminated in standardization
      • Changing the scale of measurement of \(x\) and/or \(y\) does not change \(r\)
    • For cases where \(x\) and \(y\) are both greater than their mean or less than their mean, the sum gets bigger as the product is positive
    • For cases where one of \(x\) and \(y\) is greater than its mean and the other is less than it mean, the product is negative and the sum is made smaller
  • Not obvious from the equation, but provable with a mathematical argument is that \(-1 \le r \le 1\).
  • \(r = 1\) if and only if the points fall exactly on a line with a positive slope
  • \(r = -1\) if and only if the points fall exactly on a line with a negative slope

Alternative Formula

  • It is a very minor point, but we used a different estimate of the standard deviation, \(\hat{\sigma} = \sqrt{\sum_i(x_i - \bar{x})^2/n}\), which uses \(n\) instead of the \(n-1\) in the formula for \(s\), then we could say that \(r\) is exactly the mean of the product of the z-scores of the two variables.

\[ r = \mathsf{Corr}(x,y) = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{\hat{\sigma}_x} \right) \left(\frac{y_i - \bar{y}}{\hat{\sigma}_y} \right) \]

ggplot(lions, aes(x = age, y = black)) +
  geom_point() +
  xlab("Age (years)") +
  ylab("Percentage Black") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Ages and Lion Nose Color",
          subtitle = "Red dashed lines at variable means") +
  geom_vline(xintercept = lions_sum$age_mean, color = "red", linetype = "dashed") +
  geom_hline(yintercept = lions_sum$black_mean, color = "red", linetype = "dashed") +
  theme_bw() +
  theme(text = element_text(size = 20)) 

  • Almost all points are in the upper right or lower left, so \(r > 0\).
  • The points are not on exactly straight line with a positive slope, so \(r < 1\).

Calculation

  • The built-in R function cor() calculate the correlation coefficient.
x = lions %>% pull(age)
y = lions %>% pull(black)
cor(x,y)
## [1] 0.7898272
  • Demonstrate with a manual calculation
lions_calc = lions %>% 
  mutate(z_age = (age - mean(age))/sd(age),
         z_black = (black - mean(black))/sd(black),
         prod = z_age * z_black)

lions_calc %>% 
  print(n = Inf)
## # A tibble: 32 × 5
##      age black    z_age z_black      prod
##    <dbl> <dbl>    <dbl>   <dbl>     <dbl>
##  1   1.1  0.21 -1.20    -0.565   0.677   
##  2   1.5  0.14 -1.05    -0.918   0.963   
##  3   1.9  0.11 -0.900   -1.07    0.962   
##  4   2.2  0.13 -0.788   -0.968   0.763   
##  5   2.6  0.12 -0.639   -1.02    0.650   
##  6   3.2  0.13 -0.414   -0.968   0.401   
##  7   3.2  0.12 -0.414   -1.02    0.422   
##  8   2.9  0.18 -0.527   -0.716   0.377   
##  9   2.4  0.23 -0.713   -0.464   0.331   
## 10   2.1  0.22 -0.825   -0.515   0.425   
## 11   1.9  0.2  -0.900   -0.615   0.554   
## 12   1.9  0.17 -0.900   -0.766   0.690   
## 13   1.9  0.15 -0.900   -0.867   0.781   
## 14   1.9  0.27 -0.900   -0.263   0.237   
## 15   2.8  0.26 -0.564   -0.313   0.177   
## 16   3.6  0.21 -0.265   -0.565   0.150   
## 17   4.3  0.3  -0.00350 -0.112   0.000391
## 18   3.8  0.42 -0.190    0.493  -0.0937  
## 19   4.2  0.43 -0.0409   0.543  -0.0222  
## 20   5.4  0.59  0.407    1.35    0.550   
## 21   5.8  0.6   0.557    1.40    0.779   
## 22   6    0.72  0.632    2.00    1.27    
## 23   3.4  0.29 -0.340   -0.162   0.0551  
## 24   4    0.1  -0.116   -1.12    0.129   
## 25   7.3  0.48  1.12     0.795   0.888   
## 26   7.3  0.44  1.12     0.593   0.663   
## 27   7.8  0.34  1.30     0.0897  0.117   
## 28   7.1  0.37  1.04     0.241   0.251   
## 29   7.1  0.34  1.04     0.0897  0.0935  
## 30  13.1  0.74  3.28     2.10    6.91    
## 31   8.8  0.79  1.68     2.36    3.95    
## 32   5.4  0.51  0.407    0.946   0.385
lions_calc %>% 
  summarize(r = sum(prod) / (n() - 1)) %>% 
  pull(r)
## [1] 0.7898272

Fake Data Examples

r near 0

  • There is no apparent pattern in the relationship between \(x\) and \(y\).
  • \(r\) is quite close to 0.

  • There is a deterministic non-linear relationship between \(x\) and \(y\)
  • \(r = 0\)

When \(r=0\), there is a weak linear relationship between \(x\) and \(y\). There may or may not be a strong non-linear relationship between \(x\) and \(y\).

r near 1

  • \(r\) is near 1
  • The points are tightly clustered around a line with a positive slope
  • It does not appear that a curve would fit the trend in the data substantially better.

  • \(r\) near one
  • The points are much more tightly clustered around the blue line with a positive slope than around the horizontal red dashed line
  • However, a curve would fit the data substantially better than a straight line
  • A non-linear relationship is superior to a simpler linear relationship

When \(r = 0.97\) there is a strong positive linear relationship between \(x\) and \(y\). There may or may not be an even stronger non-linear relationship between \(x\) and \(y\).

r near \(-1\)

  • \(r\) is near \(-1\)
  • The points are tightly clustered around a line with a negative slope
  • It does not appear that a curve would fit the trend in the data substantially better.

  • \(r\) is near \(-1\)
  • The points are more tightly clustered around the blue line with a negative slope than around the horizontal red dashed line.
  • However, a non-linear relationship is superior to a simpler linear relationship.

When \(r = -0.97\), there is a strong negative linear relationship between \(x\) and \(y\) There may or may not be an even stronger non-linear relationship between \(x\) and \(y\).

r near 0.7

  • \(r\) is moderate and positive
  • the points are more tightly clustered around the blue line with a positive slope than around the horizontal red dashed line
  • the strength of the linear relationship is moderate
  • a simpler linear relationship is an adequate summary of the data
    • there is no non-linear relationship that is a substantially better fit to the data

  • \(r\) is moderate and positive
  • the points are more tightly clustered around the blue line with a positive slope than around the horizontal red dashed line, but only by a smallish amount
  • the strength of the linear relationship is moderate
  • a non-linear linear relationship would fit the data better than a simple line

When \(r\) is near 0.7, there is a moderate positive linear relationship between \(x\) and \(y\). There may or may not be a nonlinear relationship which fits the data substantially better

Thought Quiz

  1. Why is there no distinction between the explanatory and response variables in correlation?
  2. Why do both variables have to be quantitative?
  3. How does changing the units of measurement change correlation?
  4. Why doesn’t a tight fit to a horizontal line imply strong correlation?

Correlation summary

  • \(r\) is a measure of the strength and direction of the linear relationship between two quantitative variables
  • The value of \(r\) alone says nothing about the strength of any potential non-linear relationships
  • Graph your data!

Introduction to Regression

Background

  • Many problems in Statistics and Data Science involve examining the relationship between variables.
  • A common setting is when:
    • a single quantitative variable is deemed a response variable
    • one or more other variables are treated as explanatory variables or predictors.
  • Such models are referred to as linear models
    • In a standard linear model, the response variable is assumed to have a normal distribution given the predictors.

Special cases

  • Linear models can be categorized based on the number and types of the predictors.
    • simple linear regression: a single quantitative predictor
    • multiple regression: multiple quantitative predictors
    • one-way analysis of variance: a single categorical predictor
    • analysis of variance: multiple categorical predictors

Generalizations

  • When the distribution of the response variable is not modeled with a normal distribution, such as when it is a count or a single 0/1 response, it is often preferable to use a generalized linear model.
    • logistic regression: for 0/1 response variables;
    • Poisson regression: for non-negative count data.
  • At other times, it is preferable to model a categorical predictor as having a distribution itself. Some of these models are called mixed effects models.

Summary of the regression line

Correlation tells us about the strength and direction of a line.

  • What if we want a description of how the variables vary together?

The regression line…

  • A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes
  • We can use a regression line to predict the value of y for a given value of x
  • The distinction between explanatory and response variables is important
  • The Least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is as small as possible.

Example: Age and Height

  • The following graph shows a plot of height in inches versus age in months of a boy from age 2 years to 8 years.
height = read_table("../../data/height.txt")
df = height %>% 
  filter(age >=2*12 & age <=8*12)
ggplot(df, aes(x = age, y = height)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") +
  ylab("Height (inches)") +
  xlab("Age (months)")

Regression Model

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

  • \(Y_i\) is the response variable
  • \(X_i\) is the exoplantory variable
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the slope
  • \(\varepsilon_i\) is the random “error” between the actual value \(Y_i\) and the mean value \(\beta_0 + \beta_1 X_i\).
  • A common method to estimate \(\beta_0\) and \(\beta_1\) from data \(x\) and \(y\) is the method of “least squares”
    • This line minimizes the sum of squared vertical distances between the points and a line
    • Find values \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize

\[ \sum_{i=1}^n \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2 \]

  • There is a multivariate calculus solution to this problem, but we explore a different way to do it.

lm()

  • We can use lm() to fit the linear regression model
df_lm = lm(height ~ age, data = df)
cf = coef(df_lm)
cf
## (Intercept)         age 
##  30.2493290   0.2505185
summary(df_lm)
## 
## Call:
## lm(formula = height ~ age, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29911 -0.19291 -0.03355  0.21982  0.46334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.249329   0.186745  161.98   <2e-16 ***
## age          0.250519   0.002869   87.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2429 on 14 degrees of freedom
## Multiple R-squared:  0.9982, Adjusted R-squared:  0.998 
## F-statistic:  7627 on 1 and 14 DF,  p-value: < 2.2e-16
  • The coefficients are the slope, \(\beta_1 = 0.251\), and the intercept \(\beta_0 = 30.249\).
    • The slope has units inches per month (y over x)
    • The intercept has units inches (y)
  • For this data set, \(x=0\) is outside the range of the data
    • The value at \(x=0\) has a meaningful interpretation, the height at birth
    • However, this interpretation requires an extrapolation beyond the range of the data which assumes that the linear relationship continues

Coefficients and Summary Statistics

  • The least squares coefficients have simple equations based on summary statistics of the data.

\[ \hat{\beta}_1 = r \times \frac{s_y}{s_x} \]

\[ \hat{\beta}_0 = \bar{y} - b_1 \bar{x} \]

  • Note that the regression line goes through the point \((\bar{x},\bar{y})\).
x = df$age
y = df$height

xbar = mean(x)
ybar = mean(y)
sx = sd(x)
sy = sd(y)
r = cor(x,y)

c(xbar, ybar, sx, sy, r)
## [1] 61.5625000 45.6718750 21.8661649  5.4829043  0.9990835
b1 = r *sy/sx
b0 = ybar - b1*xbar

c(b0, b1)
## [1] 30.2493290  0.2505185
cf
## (Intercept)         age 
##  30.2493290   0.2505185

Estimated model

Using observations \((x_1,y_1), \ldots, (x_n,y_n)\), we can estimate \(\beta_0\) and \(\beta_1\) to get our Least-squares regression line as

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i, \] where \(\hat{y}_i\) is the predicted response corresponding to \(x_i\), and \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimated intercept and slope, respectively.

Least-squares Estimation: minimize the squared errors with respect to the unknown parameters.
- This is accomplished by minimizing the following with respect to the unknown parameters:
\[ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 \] - This results in the estimators noted above for \(\beta_0\) and \(\beta_1\).

Predicted values

  • The straightforward way to make a prediction is just to plug into the regression equation with the estimated coefficients.

  • Estimate the boy’s height at age 50 months.

## one way
est = b0 + 50*b1
est
## [1] 42.77525
## using coef()
sum(cf * c(1,50))
## [1] 42.77525
  • The regression line crosses has the value \(\hat{y} = 42.775\) at \(x=50\) months.
## Visualization of this prediction
ggplot(df, aes(x = age, y = height)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") +
  geom_point(aes(x=50, y=b0 + 50*b1), color="red", size=4) + 
  geom_vline(xintercept = 50, color="red", linetype="dashed") +
  ylab("Height (inches)") +
  xlab("Age (months)")

Understand by standard units

  • We can also understand this with standard units.

  • The value \(x=50\) is \(z = (50 - \bar{x})/s_x)\) standard deviations from the mean.

x0 = 50
z = (x0 - mean(x))/sd(x)
z
## [1] -0.528785
  • We predict that \(y\) will be \(rz\) standard deviations from \(\bar{y}\).
    • Here, \(r\) is very close to one, so we predict that \(y\) is nearly the same number of standard deviations from \(\bar{y}\) as \(x\) was from \(\bar{x}\).
yhat = mean(y) + r*z*sd(y)
yhat
## [1] 42.77525

Lion Data

  • Let’s try this with the lions data
lions = read_csv("../../data/lions.csv") %>% 
  rename(black = proportion.black)

ggplot(lions, aes(x = age, y = black)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") +
  xlab("Age (years)") +
  ylab("Proportion of nose that is black")

lions_lm = lm(black ~ age, data = lions)
lions_cf = coef(lions_lm)
lions_cf
## (Intercept)         age 
##  0.06969626  0.05859115
summary(lions_lm)
## 
## Call:
## lm(formula = black ~ age, data = lions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20406 -0.07758 -0.01750  0.07913  0.29876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.069696   0.041956   1.661    0.107    
## age         0.058591   0.008307   7.053 7.68e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1238 on 30 degrees of freedom
## Multiple R-squared:  0.6238, Adjusted R-squared:  0.6113 
## F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

Summary Statistics

lions_sum = lions %>% 
  summarize(across(everything(), list(mean = mean, sd = sd)),
            n = n(),
            r = cor(age, black)) %>% 
  relocate(n)

lions_sum = lions_sum %>% 
  mutate(b1 = r*black_sd/age_sd,
         b0 = black_mean - b1*age_mean)

lions_sum %>% 
  print(width = Inf)
## # A tibble: 1 × 8
##       n age_mean age_sd black_mean black_sd     r     b1     b0
##   <int>    <dbl>  <dbl>      <dbl>    <dbl> <dbl>  <dbl>  <dbl>
## 1    32     4.31   2.68      0.322    0.199 0.790 0.0586 0.0697
  • Predict at age 10 years
yhat_1 = sum( c(1,10) * lions_cf )
yhat_1
## [1] 0.6556078

Standard Units

x0 = 10
z = (x0 - lions_sum$age_mean)/lions_sum$age_sd
z
## [1] 2.126077
  • 10 years is 2.126 standard deviations above the mean
lions_sum$r*z
## [1] 1.679234
  • Predict the proportion black of the nose to be \(rz = 1.68\) standard deviations above the mean.
yhat_2 = lions_sum$black_mean + lions_sum$r*z*lions_sum$black_sd
yhat_2
## [1] 0.6556078

Regression Example with fake data

We are going to generate a fake data set to use for our example. By using fake data, we can also check how close our regression estimates match the true values we used to simulate our data.

## Generate our fake data set
set.seed(246810)
n = 100 ## sample size
b0 = 1  ## intercept
b1 = 2.5  ## slope
x = runif(n, -3, 10)  ## explanatory variable
y = rnorm(n,b0+b1*x,3)  ## response variable

tibble(x,y) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_point()

## Estimate our slope and intercept
mx = mean(x)
my = mean(y)
sx = sd(x)
sy = sd(y)

r = cor(x,y) # correlation coef
r
## [1] 0.9567482
cor(x,y)
## [1] 0.9567482
cor(y,x)
## [1] 0.9567482
b1_hat = r *sy/sx
b1_hat
## [1] 2.462811
b0_hat = my - b1_hat*mx
b0_hat
## [1] 0.8643708

We also can use lm() to estimate the slope and intercept:

df0 = tibble(x=x, y=y)
lm0 = lm(y~x, df0)
summary(lm0)
## 
## Call:
## lm(formula = y ~ x, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5064 -2.0083  0.2062  1.9766  7.2757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.86437    0.39085   2.211   0.0293 *  
## x            2.46281    0.07565  32.557   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.951 on 98 degrees of freedom
## Multiple R-squared:  0.9154, Adjusted R-squared:  0.9145 
## F-statistic:  1060 on 1 and 98 DF,  p-value: < 2.2e-16
cf = coef(lm0)
cf
## (Intercept)           x 
##   0.8643708   2.4628111
c(b0, b1)
## [1] 1.0 2.5

Residual plots

One approach for checking if our linear assumption holds is to create a residual plot. Recall that residual corresponding to the \(i\)th observation is \(r_i = y_i - \hat{y}_i\).

library(modelr)
df0 = df0 %>%
  add_residuals(lm0) %>%
  add_predictions(lm0)

ggplot(df0, aes(x=x, y =resid)) +
  geom_point() +
  xlab("x") +
  ylab("Residuals") +
  geom_hline(aes(yintercept=0), color="red", linetype = "dashed")

As a comparison, below is a residual plot for a different simulated data set.

Do you see any pattern in this residual plot?

There is a clear pattern. The residuals near the boundaries are positive and the residuals near the middle of the x range are negative.

This suggests that the relationship between y and x is not linear and appears to be quadratic.

Overview of Exoplanet Mass-Radius Relationship

In previous lectures, we explored data from the NASA Exoplanet Archive. Two variables that are useful for characterizing an exoplanet are mass and radius.

The two most prolific methods for detecting exoplanets are the Transit Method and the Radial Velocity Method.

Data

  • Read in the exoplanet data, skipping the rows with meta data
  • Select a subset of variables and rename them
## Read in the csv file
## Select confirmed planets, rename some variables
planets = read_csv("../../data/exoplanets-clean-through-2022.csv")

Mass-Radius Relationship

The mass-radius relationship of exoplanets is the relationship between exoplanets’ radius, R, their mass, M.

Modeling the relationship between mass and radius is important for the following reasons:

  • Prediction
    • The model can be used to predict a planet’s mass given its radius measurement
    • We have more observations with radius estimates than mass estimates, so having a way to estimate mass can be useful
## Number of mass and radius estimates
planets %>%
  select(mass, radius) %>%
  summarize_all(function(x) sum(!is.na(x)))
## # A tibble: 1 × 2
##    mass radius
##   <int>  <int>
## 1  1397   3973
  • Learning about exoplanets compositions
    • Planet compositions can be inferred by their density
    • Exoplanets can have a range of compositions such as rocky or gaseous
    • Knowing about planet composition may help to understand planetary formation and evolution processes
    • For more information about planet compositions, see exoplanet compositions

Plot Mass vs. Radius - quick look

Let’s take a quick look at what the Mass-Radius relationship looks like for our exoplanet data. We’ll talk later about a way to model these data.

## How many observations do we have with both mass and radius estimates?
planets %>%
  select(mass, radius) %>%
  drop_na() %>%
  nrow()
## [1] 1022
ggplot(planets, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)")

It’s hard to see any clear pattern in this plot. Let’s adjust the axis scales to see if that helps.

ggplot(planets, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(se=FALSE) +
  geom_smooth(method="lm", se=FALSE, color="magenta")

Now we can see a bit more of a relationship between mass and radius on this log10 scale.

The general pattern is that there is a positive association between log10(radius) and log10(mass).

Power-Law Relationship

One of the popular models used for the Mass-Radius relationship is a power-law relationship:

\[ y = C \times x^\theta \] where \(y\) is the response variable, \(x\) is the explanatory variable, \(C\) is a scaling factor, and \(\theta\) is the power law coefficient.

Examples of power laws

power_law = function(theta){
  df = tibble(x = seq(0, 10, by = .1),
               y =x^theta)
  gg = ggplot(df, aes(x,y)) +
    geom_line() +
    ggtitle(paste0("Power law exponent: ", theta)) +
    theme_bw() +
    theme(text=element_text(size=20))
  return(gg)
}

power_law(1)

power_law(.5)

power_law(2)

Power-law model for Mass-Radius relationship

While we will consider mass on the y-axis and radius on the x-axis, astronomers will often plot and model these the other way (with radius on the y-axis).

  • Since the mass measurements tend to be harder to obtain, we can look at our mass vs. radius model as useful for using an estimated radius to predict an unknown mass.

Astronomers have found that the power law relationship between mass and radius is not constant across the range of values, but instead there seems to be a different power law exponent for different mass ranges.

  • This results in what is known as a broken power law model where different ranges of the data have different power law parameters.
  • We will only consider a subset of the data where the power law exponent is thought to be constant. We define this subset below.
  • The range for masses we will consider is between 2 and 127 Earth masses. This range comes from work by Jingjing Chen and David Kipping in 2016 where they took a data-driven approach to detect change points in the broken power law model.
    • Chen, J. and Kipping, D., 2016. Probabilistic forecasting of the masses and radii of other worlds. The Astrophysical Journal, 834(1), p.17.
    • In their model, they considered radius vs. mass, but found the noted mass range to be consistent with previous work looking for change points in radius.

Below we define the data subset we will use for our model and plot mass vs. radius.

mr = planets %>%
  filter(between(mass, 2, 127)) %>%
  drop_na()

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(se=FALSE) +
  geom_smooth(method="lm", se=FALSE, color="magenta")

Fitting the Power Law Model

We already saw the form of the power law relationship is \(y = C \times x^\theta\). Now we’d like to turn this into a statistical model that we can use for the exoplanet data.

The statistical model we will actually fit is going to use a log10 transformation of the power law relationship in order to make the form linear: \[ \begin{align*} y_i = \log10(m_i) & = \log10(C) + \theta\log10(r_i) + \varepsilon_i \\ & = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \end{align*} \] In this model, the response variable \(y_i = \log10(m_i)\) is the \(\log10\)(mass) for exoplanet \(i\), the explanatory variable \(x_i = \log10(r_i)\) is the \(\log10\)(radius) for exoplanet \(i\), \(\log10(C)\) is the (unknown) intercept, \(\theta\) is the (unknown) slope, and \(\varepsilon_i\) is the random error for exoplanet \(i\).

lm1 = lm(log10(mass) ~ log10(radius), data = mr)
summary(lm1)
## 
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21927 -0.19937 -0.02025  0.17505  1.71694 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.60223    0.02860   21.06   <2e-16 ***
## log10(radius)  1.10338    0.04724   23.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.541 
## F-statistic: 545.5 on 1 and 461 DF,  p-value: < 2.2e-16

The estimated intercept is 0.602 and the estimated slope is 1.103.

Let’s check out our fit model on a plot:

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue")

Let’s also see what this looks like on the original scale.

mr %>%
  mutate(mass_pred = 10^coef(lm1)[1]*radius^coef(lm1)[2]) %>%
ggplot(aes(radius, mass)) +
  geom_point() +
  geom_line(aes(y = mass_pred), color="red") +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)")

Recall that the estimated slope is the power law exponent. Since it is somewhat near to 1, this suggests that the data on the original scale are not too far from linear in radius.

mr %>%
  mutate(mass_pred = 10^coef(lm1)[1]*radius^coef(lm1)[2]) %>%
ggplot(aes(radius, mass)) +
  geom_point() +
  geom_line(aes(y = mass_pred), color="red") +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  geom_smooth(method="lm", se=FALSE, color="blue")

summary(lm(mass ~ radius, data=mr))
## 
## Call:
## lm(formula = mass ~ radius, data = mr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.461  -8.857  -3.793   3.528 114.188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.7461     1.4677  -1.871    0.062 .  
## radius        6.9904     0.2579  27.100   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.4 on 461 degrees of freedom
## Multiple R-squared:  0.6144, Adjusted R-squared:  0.6135 
## F-statistic: 734.4 on 1 and 461 DF,  p-value: < 2.2e-16
#Back to slides

These models are quite different. We will do some model checking on the appropriateness of our linear model on the log10 scale next.

Model checking

The linear model we fit uses least squares regression.

library(modelr)
mr = mr %>%
  add_residuals(lm1) %>%
  add_predictions(lm1)

ggplot(mr, aes(x=radius, y=mass)) +
  geom_point() +
  geom_segment(aes(xend = radius, yend = 10^pred), color="magenta") +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
    theme_bw() +
    theme(text=element_text(size=20))

The vertical magenta bars are the residuals for the model defined as \(r_i = y_i - \hat{y}_i\), where \(\hat{y}_i\) is the predicted \(\log10\)(mass) from our model for exoplanet \(i\).

Residual plot

Next we are going to consider a residual plot. This is where we remove the fit model from the data, and only plot the errors (the residuals) against radius.

ggplot(mr, aes(x=radius, y=resid)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Residual (Earth Mass)") +
  scale_x_log10() +
  geom_hline(aes(yintercept = 0), color = "red", linetype = "dashed") +
  theme_bw() +
  theme(text=element_text(size=20)) +
  geom_smooth(se=FALSE)

Patterns in a residual plot can suggest that our linear model model may not be appropriate for the data.

  • In this case, you may notice that the residuals corresponding to smaller radius values tend to be positive, and there seems to be a little bit of clustering of points (e.g., above 10 Earth Radius).
  • But, overall, the model form seems reasonable.

Introduction to inference on linear models

Assumptions on the random errors

Let’s go back to our linear model, \[ y_i = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \]

We only mentioned that the \(\varepsilon_i\)’s are random errors, but we did not discuss other assumptions.

It is common to make the following assumptions (which should also be checked):

  • \(E(\varepsilon_i) = 0\).
    • This implies \(E(y_i) = \log10(C) + \theta x_i\).
  • Errors have a constant variance: Var\((\varepsilon_i) = \sigma^2\).
    • This implies Var\((y_i) = \sigma^2\).
  • The errors are uncorrelated.

When desiring to do inference on the estimated parameters, another common assumption to make is that the errors are normally distributed, \(\varepsilon_i \sim N(0, \sigma)\).

  • This implies that \(y_i \sim N(\log10(C) + \theta x_i, \sigma)\).

This normality assumption on the errors has implications for the estimates of our parameters. In particular, it has the consequence that the estimators of our intercept and slope are normally distributed.

Recall: Z-scores and t-scores

In a previous discussion assignment you were introduced to Z-scores and t-scores. The t-scores are going to show in our regression inference, so we review that information here.

Note that in this review, the estimator is the sample mean. When we get back to regression the estimators will be for the slope or intercept, which will result in some changes in the t-statistic and the degrees of freedom of the resulting t-distribution.

Assume a model where \(X_1,\ldots,X_n\) are drawn at random from a distribution with a mean \(\mu\) and a standard deviation \(\sigma\).

The sampling distribution of the sample mean, \(\bar{X} = n^{-1}\sum_{i=1}^n X_i\) has a mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\).

A mathematical derivation is required to show this formally, or simulation can be used to check if it the expressions are plausible.

Z-Score

We have seen in many settings that the z-statistic (substract the mean, divide by the standard deviation) often has an approximate standard normal distribution. \[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \] If the distribution of each \(X_i\) is normal, then \(Z\) will be normal as well.

Even if \(X_i\) does not have a normal distribution, the distribution of \(\bar{X}\) will be approximately normal if \(n\) is large enough to overcome the non-normality, a result known as the central limit theorem.

t distribution

However, \(\sigma\) is typically unknown and things are a bit different when the sample standard deviation \(s\) is substituted for \(\sigma\). \[ T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \] where \[ S = \sqrt{ \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1} } \]

The added random variability in the denominator means that even when the distribution of a single random variable is exactly normal, the distribution of \(T\) is not.

Instead, it has a \(t\) distribution with \(n-1\) degrees of freedom, which is a bell-shaped density centered at zero, but more spread out than a standard normal density.

When the degrees of freedom becomes large, the \(t\) distribution is quite close to standard normal.

It is identical to standard normal when the degrees of freedom is infinite.

R functions

These R functions are similar to their normal counterparts.

  • rt(): generate random variables from a t distribution
  • pt(): find an area under a t density
  • qt(): find a quantile from a t density
  • dt(): return the height of the t density

In addition, the following functions are available in the script ggprob.R to add t densities to plots.

  • geom_t_density(): add a t density to a plot
  • geom_t_fill(): add a filled t density
  • gt(): graph a t density

The following graph shows a standard normal distribution in black and t distributions with degrees of freedom equal to 1, 2, 4, 8, , 1028 ranging from yellow to violet.

Confidence Intervals

A confidence interval for \(\mu\) has the form \[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \] where \(t^*\) is selected so that the area between \(-t^*\) and \(t^*\) under a t density with \(n-1\) degrees of freedom is the desired confidence level.

A confidence interval for a difference between means, \(\mu_1 - \mu_2\), has the form \[ \bar{x}_1 - \bar{x}_2 \pm t^* \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} } \] where \(t^*\) is from a \(t\) distribution where the degrees of freedom is estimated as a function of the sample sizes and standard deviations. Use the function t.test(). This approach assumes that the standard deviations of the two populations need not be the same.

Hypothesis Tests

When using t distribution methods, p-values are found by calculating the t statistic and finding areas under t distributions.

Back to regression

These concepts will be used to carry out inference on the estimated parameters of our simple linear model, which we will do next.

Inference for linear models

If we make the assumption that the errors on linear model are normally distributed, we can carryout hypothesis test on our estimated parameters.

We will focus on inference for the slope parameter since that tends to be the more scientifically interesting parameter.

The hypothesis test we will carry out is \[ H_0: \theta = 1 \\ H_a: \theta \neq 1 \]

We test the null that our slope parameter \(\theta\) (which is the power law exponent) is 1, suggesting a linear relationship between \(\log10\)(mass) and \(\log10\)(radius), against the alternative that there is a power law relationship.

This leads to the test statistic \[ T = \frac{\hat{\theta} - 1}{s_{\hat{\theta}}} \] where \(s_{\hat{\theta}}\) is the standard error of \(\hat{\theta}\). Note that the 1 is from the null hypothesis assumption that \(\theta = 1\).

n = nrow(mr)
syy = sum(mr$resid^2)
sxx = sum((log10(mr$radius) - mean(log10(mr$radius)))^2)
sqrt(syy/(n-2)/sxx)  ## standard error using the formula above
## [1] 0.04724113
coef(summary(lm1))[2, "Std. Error"] ## standard error from our lm1 model
## [1] 0.04724113

This is also the standard error that would be used in a confidence interval for \(\theta\): \(\hat{\theta} \pm t_{n-2} s_{\hat{\theta}}\), where \(t_{n-2}\) is selected based on the desired confidence level.

The output of our estimated model gives us the result of (a different) hypothesis test: \[ H_0: \theta = 0 \\ H_a: \theta \neq 0 \]

summary(lm1)
## 
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21927 -0.19937 -0.02025  0.17505  1.71694 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.60223    0.02860   21.06   <2e-16 ***
## log10(radius)  1.10338    0.04724   23.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.541 
## F-statistic: 545.5 on 1 and 461 DF,  p-value: < 2.2e-16

The estimated slope is \(\hat{\theta} = 1.12339\) with an estimated standard error of \(SE(\hat{\theta})=0.05088\).

This leads to a t-statistic of \[ T = \frac{1.12339 - 0}{0.05088} = 22.08 \]

We see that there are 368 observations in our data set. Since there are two estimated parameters in our linear model (slope and intercept), we use 368-2 = 366 degrees of freedom for the t-distribution for our slope.

## Number of observations
mr %>%
  nrow()
## [1] 463
## Compute our p-value
pt(22.08, df=366, lower.tail=FALSE)*2 ## P(T >= 22.08) x 2
## [1] 2.785903e-69

Notice that this p-value is very small leading us to reject the null hypothesis that the slope is zero.

To carry out our hypothesis test with \[ H_0: \theta = 1 \\ H_a: \theta \neq 1 \] we use the following t-statistic: \[ T = \frac{1.12339 - 1}{0.05088} = 2.425118 \]

## t-statistic
tstat = (coef(summary(lm1))[2]-1)/coef(summary(lm1))[2, "Std. Error"]
tstat
## [1] 2.188302
## Compute our p-value
pt(tstat, df=nrow(mr)-2, lower.tail=FALSE)*2 ## P(T >= 2.425118) x 2
## [1] 0.02914877

Since the p-value is less than 0.05, there is evidence suggesting that we can reject the null hypothesis. In other words, the results suggest that a linear model is not appropriate, in favor of a power law exponent greater than 1 (p-value=0.016, two-sided t-test).

We can create a plot to visualize the p-value:

gt(nrow(mr)-2) +
  geom_vline(aes(xintercept = c(-abs(tstat), tstat)), color="red", linetype="dashed") +
  geom_t_fill(nrow(mr)-2, a = abs(tstat), b=6) +
  geom_t_fill(nrow(mr)-2, a=-6, b = -abs(tstat))

Inference for linear models via simulation

Instead of relying on the theoretical values for the standard deviation of the parameters, we could instead run a simulation.

We can use the parametric bootstrap, which involves generating many realizations of the data using the initial estimate for \(\theta\) and then fitting the regression model on the simulated data set to obtain many estimates of \(\theta\). The mean and standard deviation of these values can be used for inference.

Here are the steps for the parametric bootstrap:

  1. Estimate the parameters of the statistical model.
  2. Use the estimated model and simulate a new data set of the same size.
    • In a regression framework, this would involve adding to each predicted value \(y^*\) a new simulated normally distributed random error from a distribution with mean zero and the estimated standard deviation.
  3. Fit a linear model to the simulated data set and estimate the coefficients.
  4. Repeat steps 2 and 3 many times.
  5. Calculate the standard deviations of the coefficients from the simulated data sets as estimates of the standard errors.
n = mr %>%
  select(mass, radius) %>%
  drop_na() %>%
  nrow()
sigma_resid = sd(mr$resid)
N = 100 # 10000
b_0 = numeric(N)
b_1 = numeric(N)

for ( i in 1:N )
{
  mr_new = mr %>%
    drop_na() %>%
    mutate(mass = 10^(pred + rnorm(n,0,sigma_resid))) # these are the "new" y's
  lm2 = lm(log10(mass) ~ log10(radius), data=mr_new)
  b_0[i] = coef(lm2)[1]
  b_1[i] = coef(lm2)[2]
}

df_coef = tibble(b_0,b_1)

Let’s look at the slope parameter. We can estimate the mean and standard deviation from the parametric bootstrap simulation, and then compare it to a normal distribution with the same mean and standard deviation.

mean_slope = mean(df_coef$b_1)
mean_slope
## [1] 1.110855
sd_slope = sd(df_coef$b_1)
sd_slope
## [1] 0.04822574
ggplot(df_coef, aes(x=b_1)) +
  geom_density() +
  xlab("theta") +
  ylab("Density") +
  geom_norm_density(mu = mean_slope, sigma = sd_slope, color="blue") +
  ggtitle("Exoplanet Mass-Radius Relationship",
          subtitle = "Parametric bootstrap distribution of the slope (black), and a normal density (blue)")

The simulated distribution (black curve) follows the normal distribution (blue curve) quite well.

So how do these values compare to what we calculated with the theoretical equations?

paste("Bootstrap: ", round(mean_slope,3), round(sd_slope,3))
## [1] "Bootstrap:  1.111 0.048"
paste("Theoretical: ", round(coef(lm1)[2],3), round(coef(summary(lm1))[2, "Std. Error"],3))
## [1] "Theoretical:  1.103 0.047"

Very close as well!

Prediction

Recall: Mass-Radius Relationship

The mass-radius relationship of exoplanets is the relationship between exoplanets’ radius, R, their mass, M.

Modeling the relationship between mass and radius is important for the following reasons:

  • Prediction
    • The model can be used to predict a planet’s mass given its radius measurement
    • We have more observations with radius estimates than mass estimates, so having a way to estimate mass can be useful
## Number of mass and radius estimates
planets %>%
  select(mass, radius) %>%
  summarize_all(function(x) sum(!is.na(x)))
## # A tibble: 1 × 2
##    mass radius
##   <int>  <int>
## 1  1397   3973
  • Learning about exoplanets compositions
    • Planet compositions can be inferred by their density
    • Exoplanets can have a range of compositions such as rocky or gaseous
    • Knowing about planet composition may help to understand planetary formation and evolution processes
    • For more information about planet compositions, see exoplanet compositions

Plot Mass vs. Radius - quick look

Let’s take a quick look at what the Mass-Radius relationship looks like for our exoplanet data. We’ll talk later about a way to model these data.

ggplot(planets, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(se=FALSE) +
  geom_smooth(method="lm", se=FALSE, color="magenta")

The general pattern is that there is a positive association between log10(radius) and log10(mass).

Recall: Power-law model for Mass-Radius relationship

While we will consider mass on the y-axis and radius on the x-axis, astronomers will often plot and model these the other way (with radius on the y-axis).

  • Since the mass measurements tend to be harder to obtain, we can look at our mass vs. radius model as useful for using an estimated radius to predict an unknown mass.

Astronomers have found that the power law relationship between mass and radius is not constant across the range of values, but instead there seems to be a different power law exponent for different mass ranges.

  • This results in what is known as a broken power law model where different ranges of the data have different power law parameters.
  • We will only consider a subset of the data where the power law exoponent is thought to be constant. We define this subset below.
  • The range for masses we will consider is between 2 and 127 Earth masses. This range comes from work by Jingjing Chen and David Kipping in 2016 where they took a data-driven approach to detect change points in the broken power law model.
    • Chen, J. and Kipping, D., 2016. Probabilistic forecasting of the masses and radii of other worlds. The Astrophysical Journal, 834(1), p.17.
    • In their model, they considered radius vs. mass, but found the noted mass range to be consistent with previous work looking for change points in radius.

Below we define the data subset we will use for our model and plot mass vs. radius.

mr = planets %>%
  filter(between(mass, 2, 127)) %>%
  drop_na()

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(se=FALSE) +
  geom_smooth(method="lm", se=FALSE, color="magenta")

Recall: Power law model

The form of the power law relationship is \(y = C \times x^\theta\).

The statistical model we will actually fit is going to use a log10 transformation of the power law relationship in order to make the form linear: \[ \begin{align*} y_i = \log10(m_i) & = \log10(C) + \theta\log10(r_i) + \varepsilon_i \\ & = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \end{align*} \] In this model, the response variable \(y_i = \log10(m_i)\) is the \(\log10\)(mass) for exoplanet \(i\), the explanatory variable \(x_i = \log10(r_i)\) is the \(\log10\)(radius) for exoplanet \(i\), \(\log10(C)\) is the (unknown) intercept, \(\theta\) is the (unknown) slope, and \(\varepsilon_i\) is the random error for exoplanet \(i\).

  • Fit the model:
lm1 = lm(log10(mass) ~ log10(radius), data = mr)
summary(lm1)
## 
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21927 -0.19937 -0.02025  0.17505  1.71694 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.60223    0.02860   21.06   <2e-16 ***
## log10(radius)  1.10338    0.04724   23.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.541 
## F-statistic: 545.5 on 1 and 461 DF,  p-value: < 2.2e-16
## Add residuals and predicted values to our data frame 
mr = mr %>%
  add_residuals(lm1) %>%
  add_predictions(lm1)

The estimated intercept is 0.602 and the estimated slope is 1.103.

Prediction

Let’s take a look at our estimated model on a plot again:

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
  theme(text = element_text(size=20))

The estimated regression model is \[ \hat{y}_i = 0.602 + 1.103 x_i. \]

We can use the estimated linear model to predict a mass for a given radius.

Note that the function below assumes the input x is radius on the original scale and returns an estimated mass on the original scale.

predict_y = function(x){
    ## x = radius (on original scale)
    slope = coef(lm1)[2]
    intercept = coef(lm1)[1]
    logy = intercept + slope*log10(x)
    y = 10^logy
    names(y) = "predicted mass"
    return(y)
}

radius_input = 3
mass_predicted = predict_y(3)
mass_predicted
## predicted mass 
##       13.44852

We can plot this point as well:

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
  geom_point(aes(x = radius_input, y=mass_predicted), color = "red", size = 2) +
  theme(text = element_text(size=20))

An interesting and useful feature of the least squares regression line is that it goes through the point \((\bar{x}, \bar{y})\). We can check this with our function as well.

  • This property holds on the scale of the linear model.
  • Since our \(x\) and \(y\) variables were transformed, we have to make some adjustments to the scale.
radius = 10^mean(log10(mr$radius)) ## mean of log10(radius), transformed back to original scale for function
radius
## [1] 3.244174
log10(predict_y(radius)) ## predicted value at xbar
## predicted mass 
##       1.166171
print(paste0("(xbar,ybar) = (",round(mean(log10(mr$radius)),3), ", ", round(mean(log10(mr$mass)),3),")"))
## [1] "(xbar,ybar) = (0.511, 1.166)"

We can get our predicted values of mass, but there is uncertainty in this estimate, and it can be desirable to define an interval around the estimate to capture this uncertainty.

There are two common ways to look at this problem of predicting \(\hat{y}\):

  • Confidence interval
    • The goal is an uncertainty interval around the parameter \(E(y \mid x^*)\), the expected value of the response given the explanatory variable \(x^*\).
  • Prediction interval
    • The goal is an uncertainty interval around some future \(y^*\) for some given \(x^*\).
    • The prediction interval is trying to capture a random outcome \(y^*\) rather than a fixed parameter like the \(E(y \mid x^*)\).

We will focus on the confidence interval version next.

Confidence intervals for \(E(y \mid x^*)\)

The goal is an uncertainty interval around the parameter \(E(y \mid x^*)\), the expected value of the response given the explanatory variable \(x^*\).

The uncertainty in the estimated \(\hat{y}\) needs to account for the uncertainty in the estimated intercept and estimated slope. The formula for this uncertainty is \[ s_{\hat{y}} = \sqrt{\left((n-2)^{-1} \sum_{i=1}^n(y_i - \hat{y}_i)^2\right)\left(n^{-1} + \frac{(x^* - \overline{x})^2}{\sum_{i = 1}^n (x_i - \overline{x})^2} \right)} \]

where \(x^*\) is the radius (on the original scale) for which the confidence interval of \(E(y \mid x^*)\) is desired.

We can write a function to compute \(s_{\hat{y}}\):

s_yhat = function(x){
  ## x = radius on original scale
  n = nrow(mr)
  syy = sum(mr$resid^2)/(n-2)
  mean_logx <-mean(log10(mr$radius))
  sxx = sum((log10(mr$radius) - mean_logx)^2)
  out = sqrt(syy*(1/n + (log10(x)-mean_logx)^2/sxx))
  return(out)
}

s_yhat(3)
## [1] 0.01541483

Next we add the lower and upper bounds for 95% confidence interval to mr, then add them to our plot.

n = nrow(mr)
mr = mr %>%
  mutate(y_plus_se = pred + qt(.975, n-2)*s_yhat(radius),
         y_minus_se = pred - qt(.975, n-2)*s_yhat(radius))

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(method="lm", se=TRUE, color="red")+
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
  geom_line(aes(x = radius, y= 10^y_plus_se), color = "red", linetype="dashed") +
  geom_line(aes(x = radius, y= 10^y_minus_se), color = "red", linetype="dashed") +
  geom_vline(aes(xintercept = 10^(mean(log10(radius)))), color="blue", linetype="dotted") +
  theme(text = element_text(size=20)) 

Notice that our 95% confidence interval outlines the shaded region when we use geom_smooth(method="lm", se=TRUE)!

It is subtle in the plot, but confidence interval narrows as \(x^*\) gets closer to the mean of the \(\log10\)(radius).

Prediction Intervals for random \(y^*\)

The uncertainty in some random outcome \(y^*\) needs to account for the uncertainty in the estimated intercept and estimated slope, but also the uncertainty related to the randomness of the outcome (from the \(\varepsilon^*\) associated with \(y^*\)).

The formula for this uncertainty is \[ s_{\hat{y}^*} = \sqrt{\left((n-2)^{-1} \sum_{i=1}^n(y_i - \hat{y}_i)^2\right)\left(1 + n^{-1} + \frac{(x^* - \overline{x})^2}{\sum_{i = 1}^n (x_i - \overline{x})^2} \right)} \]

where \(x^*\) is the radius (on the original scale) that is associated with \(\hat{y}^*\).

We can write a function to compute \(s_{\hat{y}}\):

s_yhatstar = function(x){
  ## x = radius on original scale
  n = nrow(mr)
  syy = sum(mr$resid^2)/(n-2)
  mean_logx <-mean(log10(mr$radius))
  sxx = sum((log10(mr$radius) - mean_logx)^2)
  out = sqrt(syy*(1 + 1/n + (log10(x)-mean_logx)^2/sxx))
  return(out)
}

s_yhatstar(3)
## [1] 0.3302439

Next we add the lower and upper bounds for 95% prediction interval to mr, then add them to our plot

n = nrow(mr)
mr = mr %>%
  mutate(ystar_plus_se = pred + qt(.975, n-2)*s_yhatstar(radius),
         ystar_minus_se = pred - qt(.975, n-2)*s_yhatstar(radius))

ggplot(mr, aes(radius, mass)) +
  geom_point() +
  xlab("Radius (Earth Radius)") +
  ylab("Mass (Earth Mass)") +
  scale_x_log10() +
  scale_y_log10() +
  geom_line(aes(x = radius, y= 10^ystar_plus_se), color = "green", linetype="dashed", size= 1.5) +
  geom_line(aes(x = radius, y= 10^ystar_minus_se), color = "green", linetype="dashed", size= 1.5) +
  geom_smooth(method="lm", se=TRUE, color="red")+
  geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
  geom_line(aes(x = radius, y= 10^y_plus_se), color = "red", linetype="dashed") +
  geom_line(aes(x = radius, y= 10^y_minus_se), color = "red", linetype="dashed") +
  geom_vline(aes(xintercept = 10^(mean(log10(radius)))), color="blue", linetype="dotted") +
  theme(text = element_text(size=20))

Notice the 95% prediction interval (green dashed lines) is wider than the 95% confidence interval.